{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Creating a searchable Art Database with The MET's open-access collection</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using MMLSpark. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os, sys, time, json, requests\n",
    "from pyspark.ml import Transformer, Estimator, Pipeline\n",
    "from pyspark.ml.feature import SQLTransformer\n",
    "from pyspark.sql.functions import lit, udf, col, split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "VISION_API_KEY = os.environ['VISION_API_KEY']\n",
    "AZURE_SEARCH_KEY = os.environ['AZURE_SEARCH_KEY']\n",
    "search_service = \"mmlspark-azure-search\"\n",
    "search_index = \"test\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = spark.read\\\n",
    "  .format(\"csv\")\\\n",
    "  .option(\"header\", True)\\\n",
    "  .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv\")\\\n",
    "  .withColumn(\"searchAction\", lit(\"upload\"))\\\n",
    "  .withColumn(\"Neighbors\", split(col(\"Neighbors\"), \",\").cast(\"array<string>\"))\\\n",
    "  .withColumn(\"Tags\", split(col(\"Tags\"), \",\").cast(\"array<string>\"))\\\n",
    "  .limit(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mmlspark.cognitive import DescribeImage\n",
    "from mmlspark.stages import SelectColumns\n",
    "\n",
    "#define pipeline\n",
    "describeImage = DescribeImage()\\\n",
    "                .setSubscriptionKey(VISION_API_KEY)\\\n",
    "                .setUrl(\"https://eastus.api.cognitive.microsoft.com/vision/v2.0/describe\")\\\n",
    "                .setImageUrlCol(\"PrimaryImageUrl\")\\\n",
    "                .setOutputCol(\"RawImageDescription\")\\\n",
    "                .setConcurrency(5)\\\n",
    "                .setMaxCandidates(5)\n",
    "\n",
    "getDescription = SQLTransformer(statement=\"SELECT *, RawImageDescription.description.captions.text as ImageDescriptions \\\n",
    "                                FROM __THIS__\")\n",
    "\n",
    "getTags = SQLTransformer(statement=\"SELECT *, RawImageDescription.description.tags as ImageTags FROM __THIS__\")\n",
    "                \n",
    "finalcols = SelectColumns().setCols(['ObjectID', 'Department', 'Culture', 'Medium', 'Classification', 'PrimaryImageUrl',\\\n",
    "                                     'Tags', 'Neighbors', 'ImageDescriptions', 'searchAction'])\n",
    "\n",
    "data_processed = Pipeline(stages = [describeImage, getDescription, getTags, finalcols])\\\n",
    "                    .fit(data).transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "index_dict = { \"name\" : search_index, \n",
    "               \"fields\" : [\n",
    "                  {\n",
    "                       \"name\": \"ObjectID\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"key\": True, \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Department\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  }, \n",
    "                  {\n",
    "                       \"name\": \"Culture\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Medium\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Classification\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"PrimaryImageUrl\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },                   \n",
    "                  {\n",
    "                       \"name\": \"Tags\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  },                  \n",
    "                  {\n",
    "                       \"name\": \"Neighbors\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"ImageDescriptions\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  }\n",
    "              ]}\n",
    "\n",
    "index_str = json.dumps(index_dict)\n",
    "\n",
    "options = {\n",
    "            \"subscriptionKey\" : AZURE_SEARCH_KEY,\n",
    "            \"actionCol\" : \"searchAction\",\n",
    "            \"serviceName\" : search_service,\n",
    "            \"indexJson\" : index_str\n",
    "          }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from mmlspark.io.http import *\n",
    "data_processed.writeToAzureSearch(options)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "post_url = 'https://%s.search.windows.net/indexes/%s/docs/search?api-version=2017-11-11' % (search_service, search_index)\n",
    "\n",
    "headers = {\n",
    "    \"Content-Type\":\"application/json\",\n",
    "    \"api-key\": os.environ['AZURE_SEARCH_KEY']\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "body = {\n",
    "    \"search\": \"Glass\",\n",
    "    \"searchFields\": \"Classification\",\n",
    "    \"select\": \"ObjectID, Department, Culture, Medium, Classification, PrimaryImageUrl, Tags, ImageDescriptions\",\n",
    "    \"top\":\"3\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "response = requests.post(post_url, json.dumps(body), headers = headers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "response.json()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
